DNA databases of an important tropical timber tree species Shorea leprosula (Dipterocarpaceae) for forensic timber identification

International timber trade communities are increasingly demanding that timber in the wood supply chain be sourced from sustainably harvested forests and certified plantations. This is to combat illegal logging activities to prevent further depletion of our precious forests worldwide. Hence, timber tracking tools are important to support law enforcement officials in ensuring only sustainably harvested timbers are traded in the market. In this study, we developed chloroplast DNA (cpDNA) and simple sequence repeat (SSR) databases as tracking tools for an important tropical timber tree species, Shorea leprosula from Peninsular Malaysia. A total of 1410 individual trees were sampled from 44 natural populations throughout Peninsular Malaysia. Four cpDNA regions were used to generate a cpDNA haplotype database, resulting in a haplotype map comprising 22 unique haplotypes derived from 28 informative intraspecific variable sites. This cpDNA database can be used to trace the origin of an unknown log at the regional level. Ten SSR loci were used to develop the SSR allele frequency database. Bayesian cluster analysis divided the 44 populations into two genetic clusters corresponding to Region A and Region B. Based on conservativeness evaluation of the SSR databases for individual identification, the coancestry coefficients (θ) were adjusted to 0.1900 and 0.1500 for Region A and B, respectively. These databases are useful tools to complement existing timber tracking systems in ensuring only legally sourced timbers are allowed to enter the wood supply chain.

near infrared spectroscopy) can provide more reliable forensic timber identification 8 . Each tool has its own strength and in combination they complement one another allowing authorities to overcome limitations of more traditional methods in species identification, geographic origin verification or linking illegal logs to the stumps of origin.
Genetic approaches have been used to determine the origin of wood samples from many important species, including Neobalanocarpus heimii 9,10 , Gonystylus bancanus 11 , Acer macrophyllum 12 , Cedrela odorata 13 and Chamaecyparis taiwanensis 14 . To develop timber tracking tools suitable for these species, researchers applied the principles of population genetics such as mutation, genetic drift, migration, adaptation, and speciation 8 . These methodologies utilise genetic material (genetic markers) common across groups of individuals to define populations for provenance testing or to define species for species identification 8 . During forensic timber identification, enforcement officers need to identify unknown samples at genus or species level correctly from the start, before further investigating geographic origin or individual identification. This process is commonly supported by wood anatomists through the examination of the internal structure of the specimen in comparison to reference materials 15 . In addition, such identifications can also be achieved using DNA barcoding technology based on nucleotide variation at specific gene regions 16 . One example of this is the CITES listed species with the genus Gonystylus which can be distinguished from other closely related species using a combination of genetic markers including internal transcribed spacer (ITS2), trnH-psbA intergenic spacer and trnL 11 . Once the particular species is identified, we can track the geographic region of origin using a population identification database developed from cpDNA 9,17 or single nucleotide polymorphisms (SNP) markers 13 . If the suspected log can be traced back to a particular geographic region, we can use an individual identification database to link the log to the original stump 11,12 . Subsequently, the confidence level of the match probability can be tested by calculation of random match probability between the log and stump.
The Forest Research Institute Malaysia (FRIM) has developed comprehensive DNA profiling databases for several important tropical timber species for timber tracking, namely N. heimii 10 , G. bancanus 11 , S. platyclados 18 , Intsia palembanica 19 and Aquilaria malaccensis 20 . As an extension, this study aimed to develop tracking tools for S. leprosula in the context of forensic identification. Specifically, we utilised cpDNA markers to develop a haplotype database and SSR markers to establish an allele frequency database for this important tropical timber species. We can use the cpDNA haplotype database to infer the geographic origin of an unknown sample to the regional level. Subsequently, by using the SSR allele frequency database, we can calculate the random match probability to support the strength of evidence in cases where the suspect log matches the tree stump. This gives a new impetus for higher acceptance of evidence by the judge, which will improve the success rate of prosecutions of illegal logging perpetrators.
SSR allele frequency databases were established according to Region A and B, and characterized to evaluate the relative usefulness of each SSR marker in forensic investigation. The distribution of allele frequencies for each locus is listed in Table S2 (Region A database) and Table S3 (Region B database). Forensic parameters are shown in Table 1, with a total of 143 alleles and 174 alleles detected in the Region A and B databases, respectively. The observed (H o ) and expected (H e ) heterozygosity ranged from 0.3570 to 0.8346 and 0.4375 to 0.8795, respectively for populations in the Region A database; and ranged from 0.3298 to 0.8356 and 0.3469 to 0.8793, respectively for populations in the Region B database. The power of discrimination (PD) for the SSR loci ranged from 0.601 to 0.972 and 0.554 to 0.975, in Region A and B databases, respectively. The most discriminating locus was Sle605 in both the Region A (PD = 0.972) and Region B (PD = 0.975) databases. Minimum allele frequency was adjusted for alleles falling below the thresholds of 0.0066 (Region A) and 0.0024 (Region B).
Deviations from HWE were detected in four of the SSR loci for Region A (SleT11, SleT15, SleT17 and Sle465) and six SSR loci in Region B (SleT01, SleT11, SleT15, SleT17, SleT29 and SleT31). We evaluated these loci in each population independently to rule out the possible presence of null alleles. There were four populations in Region A (GJerai, RTelui, GBongsu and Piah) where a single one locus deviated from HWE; whereas there were eight populations in Region B (Behrang, HGombak, SLalang, Angsi, Klau, USat, Jengka and Panti) with a single locus and a single population (GLedang) with two loci that deviated from HWE (  www.nature.com/scientificreports/ of completely random mating and zero migration required for HWE and LD are unlikely to be met, either in humans, animals or plants [21][22][23] . Mean self-assignment, the proportion of individuals correctly assigned back to their population, was 45.9% and ranged from 14.3% (Kenaboi) to 81.3% (CTongkat) between population (Table 2). At the regional level, correct assignment rate of individuals to their region of origin was higher, 87.4% for Region A and 90.0% for Region B, (average of 88.7%).
Conservativeness of the database. The coancestry coefficient (θ) for Peninsular Malaysia (0.0579) was higher than those of Region A (0.0454) and Region B (0.0500) ( Table 3). A total of 4.54% and 5.00% of the genetic variability was distributed among populations within Region A and Region B, respectively. In terms of inbreeding coefficient (f), the value for the Region A database (f = 0.0892) was highest, followed by Peninsular Malaysia (f = 0.0822) and Region B (f = 0.0666). All the θ and f values were significantly greater than zero, demonstrated by the 95% confidence intervals not overlapping with zero. Both of the θ and f values were used www.nature.com/scientificreports/ to calculate the conservativeness of each database by testing the cognate database (P origin ) against the regional database (P combined ). The databases were non-conservative at the calculated θ value. In order for both the Region databases (A and B) to be conservative, the value of θ was adjusted from 0.0454 to 0.1900 for Region A and from 0.0500 to 0.1500 for Region B. For the Region A database, the most common SSR profile frequency is 2.69 × 10 -7 or 1 in 3.72 million and the rarest profile frequency is 1.84 × 10 -14 or 1 in 54.3 trillion. For the Region B database, the most common SSR profile frequency is 1.06 × 10 -7 or 1 in 9.43 million and the rarest profile frequency is 4.03 × 10 -16 or 1 in 2.48 quadrillion.

Discussion
At the moment, the database is not accessible to the public. However, the public can contribute by reporting suspicious illegal logging activities to the relevant enforcement unit, so that actions can be taken and suspicious samples collected for forensic timber identification. In addressing forensic timber identification questions, we use a cpDNA haplotype database to investigate the geographic origin of a suspect log at a regional level. Subsequently, we use an SSR allele frequency database to narrow down the geographic origin to population level. After the population is identified, the state forest department's enforcement officials can verify in their system if the area was permitted for logging activities. If that area is a forest reserve where no logging permit is being issued, efforts to locate potential stumps belonging to the sampled log can be initiated. Once potential stumps are found, we can try to link the log to the potential stumps by comparing their SSR DNA profiles. A random match probability between the log and the potential stumps can be established by using the SSR allele frequency database. From the cpDNA haplotype database, haplotypes H1 and H2 were most prevalent in Peninsular Malaysia, with a frequency of 47.2% and 42.6%, respectively. The distribution of cpDNA haplotypes is overlaid by the division of populations in Region A and B as suggested by the STRU CTU RE analysis. For Region A, haplotype H2 was found in all the populations, either in all the samples (BPerangin and BEnggang) or part of the samples. The less common haplotypes, H3 (1.7%) and H4 (1.1%) were also found in Region A. Overall, we observed haplotype H2 dominates the populations in this region. Whereas for Region B, 78% of populations carried haplotype H1 in all the samples (Terenggun, Krau, Behrang, HLangat, Berembun, Kenaboi, Angsi, Triang, Pasoh, BSenggeh, Labis and ERompin), with the exception of some populations which exhibited part of their samples carried haplotype H1 (TNegara, AGading, Tekam, Jengka, Beserah, Ampang, Lentang, HGombak, SLalang, Lesong, GLedang, PPanjang and AHitam). In addition, the less common haplotype H5 (1.1%) is found solely in this region. As a whole, haplotype H1 dominates the populations in Region B. Those haplotype H2 found in the populations of Region B might be due to the retention of ancestral polymorphism by the maternally inherited cpDNA marker 24 . The remaining rare haplotypes, H6-H22, present in one or two individuals are endemic to certain populations, as shown in Fig. 1a.
Based on the cpDNA haplotypes, S. leprosula individuals from Peninsular Malaysia can be traced back to their geographical origin in either Region A or B. In forensic investigation, if the generated haplotype of an unknown log belongs to haplotype H3 or H4, we can postulate that it might have originated from Region A. Similarly, if haplotype H1 or H5 were detected, then Region B would be the most likely source of origin. However, based only on the cpDNA haplotype database, it is impossible to track an unknown log back to a specific population or forest reserve because forest reserve boundaries were defined according to political governance and thus may not necessarily reflect the distribution of natural populations of the species. It should be noted that some www.nature.com/scientificreports/ rare haplotypes might not be represented in the database, as it is impossible to collect all S. leprosula trees from every forest reserve in Peninsular Malaysia. We can include more sampling sites in the future to improve the comprehensiveness of the cpDNA haplotype database. Particularly, the inclusion of populations from other distributions such as Sumatra and Borneo could provide some insights on the evolutionary history and gene flow of the species due to isolation and separation by South China Sea between Peninsular Malaysia and Borneo as well as by Straits of Malacca between Peninsular Malaysia and Sumatra. Once the geographical origin at regional level is ascertained, an assignment test based on the SSR allele frequency database can be used to trace the samples origin to population level. In this study, we observed low assignment rates to origin populations, which may be due to the weak genetic structure (θ = 0.058) observed in this species. The value of θ shows that only 5.8% of genetic variability was found distributed among populations, thus suggesting high genetic similarity. This θ value (0.058) was higher than I. palembanica (0.026) 19 and S. platyclados (0.033) 18 but lower than G. bancanus (0.067) 11 , A. malaccensis (0.097) 20 and N. heimii (0.127) 10 . Previous study suggested that populations of S. leprosula sampled from Peninsular Malaysia were a continuous, connected forest in the past, particularly in the low inland forests 25 . Continuous distribution would promote gene flow among populations through the sharing of a common gene pool, as shown by the common haplotypes H1 and H2 observed in the cpDNA population database. The current mean assignment rate at the population level is 45.90%, which is lower than those seen in other tropical species such as G. bancanus (54.80%) 11 , I. palembanica (62.20%) 19 , S. platyclados (77.78%) 18 and A. malaccensis (92.09%) 20 . At the regional level, the mean assignment rate to region is 88.70%, which is higher than seen in I. palembanica (80.21%) 19 but lower than A. malaccensis (94.96%) 20 , S. platyclados (99.11%) 18 and G. bancanus (100%) 11 .
The identification of illegal logging sites can be achieved under two circumstances. Firstly, by utilising assignment tests based on the SSR allele frequency database to locate the original population for the suspected illegal log. Secondly, if the Forest Department has received report on illegal logging activities in a specified area. As such, with help from experienced foresters and local indigenous people who are familiar with their local forest area, it is possible to find and sample the potential stumps which potentially match the suspect log within the forest. Once potential stumps are found, a tissue sample can be collected for DNA testing following FRIM's standard operating protocol on DNA forensics for plant species identification and wood tracking 26 . If the suspect log shows a similar SSR profile to a particular stump, we can calculate a random match probability by using the SSR allele frequency database with corrected θ value. By considering both population substructuring and inbreeding coefficient, the adjusted θ value will increase the profile frequency but conversely, understating the weight of the DNA evidence against a defendant 27 , should the matter be brought before the legal system. Random match probability is the reciprocal of profile frequency (1/profile frequency), representing the estimated frequency at which a particular SSR profile would be expected to occur in a population 21 . This will help to determine the probability of a match between an unknown log and its potential origin stump. The possible profile frequency based on the 10 SSR loci ranges between the profile frequency of the most common genotype which would be the least powerful in terms of differentiating between two unrelated individuals 28 , and the rarest theoretical profile. Based on the Region A database, the possible SSR profile frequencies range from 2.69 × 10 -7 to 1.84 × 10 -14 , and for the Region B database, from 1.06 × 10 -7 to 4.03 × 10 -16 . With such low profile frequencies, we can rule out the possibility of a random match between the DNA profiles of any log and stump 21 .
In this study, we obtained cpDNA fragment and SSR loci using high quality samples such as inner bark or leaf tissue preserved in liquid nitrogen. However, many seized woods or logs are usually have been dried or processed in practice. Thus, this may pose a challenge to extract sufficient and good quality DNA from dry wood for subsequent DNA analysis. To close the DNA extraction gap in S. leprosula, our future study is to develop a suitable DNA extraction method for dry wood and processed sample. The extracted DNA is then tested by PCR amplification on both cpDNA and SSR markers utilized in the DNA databases.

Conclusions
We report on the development of cpDNA haplotype and SSR allele frequency databases for an important timber species, S. leprosula in Peninsular Malaysia. The cpDNA haplotype database enables the tracing of unknown log at the regional level. The SSR allele frequency database was validated for specificity and accuracy for the calculation of random match probability of an unknown log to a potential origin stump. This database along with the existing reference databases in other important forest timber species will serve as an impetus and increase the use of DNA technology in illegal logging investigations and verification of legality in wood supply chains.

Methods
Sample collection and DNA extraction. In this study, 1,410 S. leprosula wild samples representing 44 populations from the natural forests distributed throughout Peninsular Malaysia (Table 4) were collected. The sample collection was carried out with the permissions granted from the State Forest Departments (Kedah, Perak, Kelantan, Terengganu, Pahang, Selangor, Negeri Sembilan, Melaka and Johor), the Department of Wildlife and National Parks, Royal Belum State Park and Johor National Parks Corporation. The voucher specimen was identified by Ramli Ponyoh and deposited in FRIM herbarium centre (voucher number = A4363). Cambium or leaf tissues was collected from each sample and kept in liquid nitrogen during transportation from the field to laboratory. Total genomic DNA was extracted using the 2× cetyltrimethylammonium bromide (CTAB) 29 32 . We applied models of admixture with sampling locations included as prior population information. Correlated allele frequencies were applied with K values ranging from 1 to 10 for 10 repetitions. The optimal number of genetic clusters was estimated based on the Delta K method 33 of STRU CTU RE SELECTOR 34 . For the optimal K, data from the 10 independent runs of STRU CTU RE analyses were graphically represented using CLUMPAK 35 . To support the analysis of the population structure, a UPGMA dendrogram was constructed based on Nei's D A using POPTREE2 36 . 1000 bootstrap replicates were applied to determine the relative strength of the nodes. The populations of S. leprosula were divided into two regions, Region A and Region B based on the optimal value of K = 2 derived above. Subsequently, the SSR database was built for Region A and B, comprising 381 (12 populations) and 1029 (32 populations) individuals, respectively. Allele frequency for each locus was calculated using Microsatellite Toolkit 37 . The number of alleles per locus (A), observed (H o ) and expected heterozygosity (H e ), conformity to Hardy-Weinberg equilibrium (HWE) expectations and linkage disequilibrium (LD) between loci were assessed using Fisher's exact tests in Genetic Data Analysis (GDA) v1.1 38 . The p value for departure from HWE and LD was adjusted by Bonferroni correction 39 . Forensic parameters including polymorphic information content (PIC), matching probability (MP) and power of discrimination (PD) were assessed using Power-Stats v1.2 40 . Coancestry (θ) and inbreeding (f) coefficients for the combined database (Peninsular Malaysia) and regional database (Region A and B) were calculated with 1000 bootstrap replicates in GDA 41 . Self-assignment tests were used to evaluate the proportion of correctly assigned individuals to the designated population and region as implemented in GENECLASS2 42 .
The subpopulation-cum-inbreeding model was used to calculate the profile frequency by multiplying the frequency of each locus across all the loci 43 . The most common and rarest profile frequencies were calculated by considering an individual sample that is heterozygous at all loci possessing the two most common alleles and rarest alleles at each locus, respectively. The conservativeness of the database was estimated by calculating the full profile frequency of each individual using genotype frequencies derived from the cognate database (P origin ) against profile frequency of each individual using genotype frequencies derived from the regional database (P combined ). The relative difference between the databases (d) were defined as d = log 10 (P origin /P combined ). If the d value was negative, in the case of P origin was less than P combined , it suggests that the database is conservative 27 . For a non-conservative database, in the case of positive d value, a series of θ adjustments were applied to recalculate P combined until all samples present a negative d value.

Plant collection declaration.
We declare that all our experimental research and field sampling of plant material comply with local, national or international guidelines and legislation.

Data availability
Raw sequence information and SSR primer pairs have been deposited to NCBI; GenBank accession numbers are provided in Table S6.